Goto

Collaborating Authors

 model similarity


48237d9f2dea8c74c2a72126cf63d933-AuthorFeedback.pdf

Neural Information Processing Systems

Ouranalysishelps13 illustrate why this worst-case behavior does not attain in practice. The standard union bound does not take advantage of model similarity; the refinement in equation (3) is the24 basis for our calculations that highlight the beneficial effect of model similarity. Our bounds demonstrate a link between similarity and protection against overfitting in both the30 non-adaptiveandadaptivecase. We agree an important direction for future work is exploring the extent to which our findings transfer to other44 settings. A concrete next step is evaluating model similarity on data from Kaggle competitions, which includes45 adiversesetofsample sizes, model classes, anddata types.



Jailbreak Transferability Emerges from Shared Representations

Angell, Rico, Brinkmann, Jannik, He, He

arXiv.org Artificial Intelligence

Jailbreak transferability is the surprising phenomenon when an adversarial attack compromising one model also elicits harmful responses from other models. Despite widespread demonstrations, there is little consensus on why transfer is possible: is it a quirk of safety training, an artifact of model families, or a more fundamental property of representation learning? We present evidence that transferability emerges from shared representations rather than incidental flaws. Across 20 open-weight models and 33 jailbreak attacks, we find two factors that systematically shape transfer: (1) representational similarity under benign prompts, and (2) the strength of the jailbreak on the source model. To move beyond correlation, we show that deliberately increasing similarity through benign only distillation causally increases transfer. Our qualitative analyses reveal systematic transferability patterns across different types of jailbreaks. For example, persona-style jailbreaks transfer far more often than cipher-based prompts, consistent with the idea that natural-language attacks exploit models' shared representation space, whereas cipher-based attacks rely on idiosyncratic quirks that do not generalize. Together, these results reframe jailbreak transfer as a consequence of representation alignment rather than a fragile byproduct of safety training.




How Benchmark Prediction from Fewer Data Misses the Mark

Zhang, Guanhua, Dorner, Florian E., Hardt, Moritz

arXiv.org Artificial Intelligence

Large language model (LLM) evaluation is increasingly costly, prompting interest in methods that speed up evaluation by shrinking benchmark datasets. Benchmark prediction (also called efficient LLM evaluation) aims to select a small subset of evaluation points and predict overall benchmark performance from that subset. In this paper, we systematically assess the strengths and limitations of 11 benchmark prediction methods across 19 diverse benchmarks. First, we identify a highly competitive baseline: Take a random sample and fit a regression model on the sample to predict missing entries. Outperforming most existing methods, this baseline challenges the assumption that careful subset selection is necessary for benchmark prediction. Second, we discover that all existing methods crucially depend on model similarity. They work best when interpolating scores among similar models. The effectiveness of benchmark prediction sharply declines when new models have higher accuracy than previously seen models. In this setting of extrapolation, none of the previous methods consistently beat a simple average over random samples. To improve over the sample average, we introduce a new method inspired by augmented inverse propensity weighting. This method consistently outperforms the random sample average even for extrapolation. However, its performance still relies on model similarity and the gains are modest in general. This shows that benchmark prediction fails just when it is most needed: at the evaluation frontier, where the goal is to evaluate new models of unknown capabilities.


FedSAUC: A Similarity-Aware Update Control for Communication-Efficient Federated Learning in Edge Computing

Lee, Ming-Lun, Chou, Han-Chang, Chen, Yan-Ann

arXiv.org Artificial Intelligence

Federated learning is a distributed machine learning framework to collaboratively train a global model without uploading privacy-sensitive data onto a centralized server. Usually, this framework is applied to edge devices such as smartphones, wearable devices, and Internet of Things (IoT) devices which closely collect information from users. However, these devices are mostly battery-powered. The update procedure of federated learning will constantly consume the battery power and the transmission bandwidth. In this work, we propose an update control for federated learning, FedSAUC, by considering the similarity of users' behaviors (models). At the server side, we exploit clustering algorithms to group devices with similar models. Then we select some representatives for each cluster to update information to train the model. We also implemented a testbed prototyping on edge devices for validating the performance. The experimental results show that this update control will not affect the training accuracy in the long run.


Great Models Think Alike and this Undermines AI Oversight

Goel, Shashwat, Struber, Joschka, Auzina, Ilze Amanda, Chandra, Karuna K, Kumaraguru, Ponnurangam, Kiela, Douwe, Prabhu, Ameya, Bethge, Matthias, Geiping, Jonas

arXiv.org Artificial Intelligence

As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to as "AI Oversight". We study how model similarity affects both aspects of AI oversight by proposing a probabilistic metric for LM similarity based on overlap in model mistakes. Using this metric, we first show that LLM-as-a-judge scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from "weak-to-strong generalization". As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend -- model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.


FedAC: An Adaptive Clustered Federated Learning Framework for Heterogeneous Data

Zhang, Yuxin, Chen, Haoyu, Lin, Zheng, Chen, Zhe, Zhao, Jin

arXiv.org Artificial Intelligence

Clustered federated learning (CFL) is proposed to mitigate the performance deterioration stemming from data heterogeneity in federated learning (FL) by grouping similar clients for cluster-wise model training. However, current CFL methods struggle due to inadequate integration of global and intra-cluster knowledge and the absence of an efficient online model similarity metric, while treating the cluster count as a fixed hyperparameter limits flexibility and robustness. In this paper, we propose an adaptive CFL framework, named FedAC, which (1) efficiently integrates global knowledge into intra-cluster learning by decoupling neural networks and utilizing distinct aggregation methods for each submodule, significantly enhancing performance; (2) includes a costeffective online model similarity metric based on dimensionality reduction; (3) incorporates a cluster number fine-tuning module for improved adaptability and scalability in complex, heterogeneous environments. Extensive experiments show that FedAC achieves superior empirical performance, increasing the test accuracy by around 1.82% and 12.67% on CIFAR-10 and CIFAR-100 datasets, respectively, under different non-IID settings compared to SOTA methods.


Model Similarity Mitigates Test Set Overuse

Mania, Horia, Miller, John, Schmidt, Ludwig, Hardt, Moritz, Recht, Benjamin

arXiv.org Machine Learning

Excessive reuse of test data has become commonplace in today's machine learning workflows. Popular benchmarks, competitions, industrial scale tuning, among other applications, all involve test data reuse beyond guidance by statistical confidence bounds. Nonetheless, recent replication studies give evidence that popular benchmarks continue to support progress despite years of extensive reuse. We proffer a new explanation for the apparent longevity of test data: Many proposed models are similar in their predictions and we prove that this similarity mitigates overfitting. Specifically, we show empirically that models proposed for the ImageNet ILSVRC benchmark agree in their predictions well beyond what we can conclude from their accuracy levels alone. Likewise, models created by large scale hyperparameter search enjoy high levels of similarity. Motivated by these empirical observations, we give a non-asymptotic generalization bound that takes similarity into account, leading to meaningful confidence bounds in practical settings.